AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# this will help in making the Python code more structured automatically (good coding practice)
#load_ext nb_black
import warnings
warnings.filterwarnings("ignore")
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
#for various manipulation in my project
from sklearn import metrics
from numpy import sqrt
from numpy import argmax
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
)
import time
from uszipcode import SearchEngine
# Load the data into a data frame
Loan_Data = pd.read_csv("Loan_Modelling.csv")
# Review columns and data types
Loan_Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Generate descriptive stats on the data set
Loan_Data.describe()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 |
| mean | 2500.500000 | 45.338400 | 20.104600 | 73.774200 | 93169.257000 | 2.396400 | 1.937938 | 1.881000 | 56.498800 | 0.096000 | 0.104400 | 0.06040 | 0.596800 | 0.294000 |
| std | 1443.520003 | 11.463166 | 11.467954 | 46.033729 | 1759.455086 | 1.147663 | 1.747659 | 0.839869 | 101.713802 | 0.294621 | 0.305809 | 0.23825 | 0.490589 | 0.455637 |
| min | 1.000000 | 23.000000 | -3.000000 | 8.000000 | 90005.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 1250.750000 | 35.000000 | 10.000000 | 39.000000 | 91911.000000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 50% | 2500.500000 | 45.000000 | 20.000000 | 64.000000 | 93437.000000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 |
| 75% | 3750.250000 | 55.000000 | 30.000000 | 98.000000 | 94608.000000 | 3.000000 | 2.500000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| max | 5000.000000 | 67.000000 | 43.000000 | 224.000000 | 96651.000000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 |
# Provide a view of the count of rowns and columns in the dat set.
print("The data set has " + str(Loan_Data.shape[0]) + " columns and " + str(Loan_Data.shape[1]) + " rows.")
The data set has 5000 columns and 14 rows.
# Get top 5 rows
Loan_Data.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Identify if we have any missing data
Loan_Data.isna().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
# Drop the ID column before we start the next analysis
Loan_Data.drop(columns=['ID'], inplace=True)
#create the search engine
search = SearchEngine()
cityArray = []
stateArray = []
countyArray = []
def getCity(z):
return search.by_zipcode(zip).city
def getCounty(z):
return search.by_zipcode(zip).county
def getState(z):
return search.by_zipcode(zip).state
#Loop through zipcodes and set state
for i in range(len(Loan_Data)):
zip = Loan_Data["ZIPCode"].iloc[i]
try:
cityArray.append(getCity(zip))
except Exception as e:
cityArray.append(np.nan)
try:
countyArray.append(getCounty(zip))
except Exception as e:
countyArray.append(np.nan)
try:
stateArray.append(getState(zip))
except Exception as e:
stateArray.append(np.nan)
continue
Loan_Data['City'] = cityArray
Loan_Data['County'] = countyArray
Loan_Data['State'] = stateArray
Loan_Data.head(10)
| Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | City | County | State | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Pasadena | Los Angeles County | CA |
| 1 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | Los Angeles | Los Angeles County | CA |
| 2 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Berkeley | Alameda County | CA |
| 3 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | San Francisco | San Francisco County | CA |
| 4 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | Northridge | Los Angeles County | CA |
| 5 | 37 | 13 | 29 | 92121 | 4 | 0.4 | 2 | 155 | 0 | 0 | 0 | 1 | 0 | San Diego | San Diego County | CA |
| 6 | 53 | 27 | 72 | 91711 | 2 | 1.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | Claremont | Los Angeles County | CA |
| 7 | 50 | 24 | 22 | 93943 | 1 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 1 | Monterey | Monterey County | CA |
| 8 | 35 | 10 | 81 | 90089 | 3 | 0.6 | 2 | 104 | 0 | 0 | 0 | 1 | 0 | Los Angeles | Los Angeles County | CA |
| 9 | 34 | 9 | 180 | 93023 | 1 | 8.9 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | Ojai | Ventura County | CA |
Loan_Data['City'].unique()
array(['Pasadena', 'Los Angeles', 'Berkeley', 'San Francisco',
'Northridge', 'San Diego', 'Claremont', 'Monterey', 'Ojai',
'Redondo Beach', 'Santa Barbara', 'Belvedere Tiburon', 'Glendora',
'Santa Clara', 'Capitola', 'Stanford', 'Studio City', 'Daly City',
'Newbury Park', 'Arcata', 'Santa Cruz', 'Fremont', 'Richmond',
'Mountain View', 'Huntington Beach', 'Sacramento', 'San Clemente',
'Davis', 'Redwood City', 'Cupertino', 'Santa Clarita', 'Roseville',
'Redlands', 'La Jolla', 'Brisbane', 'El Segundo', 'Los Altos',
'Santa Monica', 'San Luis Obispo', 'Pleasant Hill',
'Thousand Oaks', 'Rancho Cordova', 'San Jose', 'Reseda', 'Salinas',
'Cardiff By The Sea', 'Oakland', 'San Rafael', 'Banning',
'Bakersfield', 'Riverside', 'Rancho Cucamonga', 'Alameda',
'Palo Alto', 'Livermore', 'Irvine', 'South San Francisco',
'Emeryville', 'Ridgecrest', nan, 'Hayward', 'San Gabriel',
'Santa Ana', 'Loma Linda', 'Encinitas', 'Fullerton',
'Agoura Hills', 'San Marcos', 'Fresno', 'Long Beach', 'Milpitas',
'Camarillo', 'Rohnert Park', 'Rosemead', 'Sherman Oaks', 'Seaside',
'Goleta', 'Walnut Creek', 'Menlo Park', 'Albany', 'Torrance',
'Hawthorne', 'Eureka', 'La Mesa', 'Edwards', 'San Ysidro',
'San Leandro', 'Mission Hills', 'Valencia', 'South Lake Tahoe',
'Porter Ranch', 'Venice', 'Anaheim', 'Sunnyvale', 'Laguna Niguel',
'Costa Mesa', 'San Ramon', 'Mission Viejo', 'San Bernardino',
'Belmont', 'Moss Landing', 'Bodega Bay', 'Hollister', 'San Pablo',
'La Palma', 'Garden Grove', 'West Sacramento', 'Seal Beach',
'Glendale', 'Chico', 'Lompoc', 'Cypress', 'Manhattan Beach',
'Folsom', 'Sanger', 'Canoga Park', 'Carson', 'Hermosa Beach',
'Vallejo', 'Fallbrook', 'Oceanside', 'Escondido', 'Highland',
'San Mateo', 'Greenbrae', 'Ukiah', 'Chino Hills', 'Chatsworth',
'Antioch', 'Orange', 'Hacienda Heights', 'Fawnskin', 'Novato',
'Pleasanton', 'Baldwin Park', 'San Luis Rey', 'Sylmar',
'Culver City', 'Arcadia', 'Pomona', 'Carlsbad', 'Montebello',
'Tustin', 'March Air Reserve Base', 'Carpinteria', 'Stockton',
'Lomita', 'Fairfield', 'Burlingame', 'Beverly Hills', 'Gilroy',
'Placentia', 'Concord', 'San Juan Bautista', 'Laguna Hills',
'Brea', 'Chula Vista', 'San Anselmo', 'Bonita', 'Citrus Heights',
'Ventura', 'Tehachapi', 'Imperial', 'Monterey Park', 'Montague',
'South Pasadena', 'Santa Rosa', 'Monrovia', 'Merced',
'National City', 'Simi Valley', 'Sunland', 'Newport Beach',
'Elk Grove', 'Trinity Center', 'San Bruno', 'Larkspur',
'El Dorado Hills', 'Poway', 'Calabasas', 'Crestline', 'La Mirada',
'Clovis', 'North Hollywood', 'San Juan Capistrano', 'Norwalk',
'Yorba Linda', 'Campbell', 'Los Alamitos', 'Aptos',
'Woodland Hills', 'Montclair', 'Westlake Village', 'Modesto',
'Castro Valley', 'Yucaipa', 'Palos Verdes Peninsula', 'Los Gatos',
'Half Moon Bay', 'Oxnard', 'Oak View', 'North Hills',
'El Sobrante', 'Martinez', 'Inglewood', 'Vista', 'Whittier',
'Rio Vista', 'Saratoga', 'Morgan Hill', 'Portola Valley',
'Redding', 'Sierra Madre', 'Sonora', 'Danville', 'Bella Vista',
'Boulder Creek', 'Lake Forest', 'Ceres', 'Alhambra', 'Chino',
'Pacific Grove', 'Napa', 'Marina', 'Alamo', 'Moraga', 'Hopland',
'Santa Ynez', 'Ben Lomond', 'Van Nuys', 'Capistrano Beach',
'Sausalito', 'Upland', 'Diamond Bar', 'South Gate', 'Clearlake',
'Ladera Ranch', 'Rancho Palos Verdes', 'Pacific Palisades',
'West Covina', 'San Dimas', 'Signal Hill', 'Tahoe City', 'Weed',
'Stinson Beach'], dtype=object)
Loan_Data['County'].unique()
array(['Los Angeles County', 'Alameda County', 'San Francisco County',
'San Diego County', 'Monterey County', 'Ventura County',
'Santa Barbara County', 'Marin County', 'Santa Clara County',
'Santa Cruz County', 'San Mateo County', 'Humboldt County',
'Contra Costa County', 'Orange County', 'Sacramento County',
'Yolo County', 'Placer County', 'San Bernardino County',
'San Luis Obispo County', 'Riverside County', 'Kern County', nan,
'Fresno County', 'Sonoma County', 'El Dorado County',
'San Benito County', 'Butte County', 'Solano County',
'Mendocino County', 'San Joaquin County', 'Imperial County',
'Siskiyou County', 'Merced County', 'Trinity County',
'Stanislaus County', 'Shasta County', 'Tuolumne County',
'Napa County', 'Lake County'], dtype=object)
Loan_Data.isna().sum()
Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 City 34 County 34 State 34 dtype: int64
Observation: We have missing data that needs to be managed now that we did not find using the lookup.
# Identify the missing zip codes
missingZips = Loan_Data[Loan_Data['State'].isnull()]
#look a the counts in each zip
missingZips['ZIPCode'].value_counts()
92717 22 96651 6 92634 5 93077 1 Name: ZIPCode, dtype: int64
# Make updates to eahc state, city and county manually for the zipcodes that did not match
for i in range(len(Loan_Data)):
zip = Loan_Data["ZIPCode"].iloc[i]
if zip == 92717:
Loan_Data["State"].iloc[i] = "CA"
Loan_Data["City"].iloc[i] = "San Francisco"
Loan_Data["County"].iloc[i] = "San Francisco County"
if zip == 96651:
Loan_Data["State"].iloc[i] = "CA"
Loan_Data["City"].iloc[i] = "San Francisco"
Loan_Data["County"].iloc[i] = "San Francisco County"
if zip == 92634:
Loan_Data["State"].iloc[i] = "CA"
Loan_Data["City"].iloc[i] = "Los Angeles"
Loan_Data["County"].iloc[i] = "Los Angeles County"
if zip == 93077:
Loan_Data["State"].iloc[i] = "CA"
Loan_Data["City"].iloc[i] = "San Francisco"
Loan_Data["County"].iloc[i] = "San Francisco County"
#check missing vvalues for state one more time
Loan_Data.isna().sum()
Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 City 0 County 0 State 0 dtype: int64
All missing data dealt with.
Questions:
# Setup reusable plot functions
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature,yLabel, xLabel, title, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
bp = sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
hp = sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter",
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
bp.set(title=title)
hp.set(ylabel=yLabel,
xlabel=xLabel)
# function to create labeled barplots
def labeled_barplot(data, feature,yLabel, xLabel, title, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
ax.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
#Standard histplot that takes params for labeling
def hist_plot(x,yLabel, xLabel, title):
plt.figure(figsize=(15, 7))
# Create seaborn box plot, takes params for labeling
hp = sns.histplot(data=Loan_Data, x=x)
#specify axis labels
hp.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
plt.show()
#boxplot with hue, takes params for labeling
def box_plot(x, y, yLabel, xLabel, title,hue):
plt.figure(figsize=(15, 7))
# Create seaborn box plot
hp = sns.boxplot(x=x,y=y,data=Loan_Data,hue=hue,dodge=False)
#specify axis labels
hp.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
plt.show()
#countplot standard plot withour hue, takes params for labeling
def count_plot(x,yLabel, xLabel, title):
plt.figure(figsize=(15, 7))
# Create seaborn countplot
cp = sns.countplot(data=Loan_Data, x=x)
#specify axis labels
cp.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
plt.show()
#countplot with dodge param, takes params for labeling
def count_plot2(x,yLabel, xLabel, title, hue):
plt.figure(figsize=(10, 5))
cp = sns.countplot(data=Loan_Data, x=x,hue=hue,dodge=False)
cp.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
plt.show()
#countplot without dodge param
def count_plot3(x,yLabel, xLabel, title, hue):
plt.figure(figsize=(10, 5))
cp = sns.countplot(data=Loan_Data, x=x,hue=hue)
cp.set(ylabel=yLabel,
xlabel=xLabel,
title=title)
plt.show()
# Count plot to see how many personal loan takers we have versus non
count_plot(Loan_Data['Personal_Loan'],'Count of customers', 'Personal Loan 1=Have one 0=Does not have one ','Got a Personal Loan after last comapaign?')
# Count plot to see credit carf distribution
count_plot2(Loan_Data['CreditCard'],'Count of customers', 'Card usage 1=Use competitor 0=Does not use competitor card ','Customer credit card (use competitor card?)','CreditCard')
# Mortgage plots
histogram_boxplot(Loan_Data, 'Mortgage','Count of customer','Mortgage amount in Thoousands of dollars','Mortgage Amounts by Customer')
# Age plots
histogram_boxplot(Loan_Data, 'Age','Count of customer','Age (years old))','Ages of the customers')
# Experience plots
histogram_boxplot(Loan_Data, 'Experience','Count of years experience','Experience (years))','Experience levels of the customers')
# Income plots
histogram_boxplot(Loan_Data, 'Income','Count of customer','Incpome (thousands of dollars))','Income level')
# Family plots
labeled_barplot(Loan_Data, 'Family','Count of customer','Family Size (people)','Family')
# Plot Family/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","Family","Family size of customer","Personal Loan (1 = yes, 2 = no)","Family size/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
# Plot AGE/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","Age","Age of customer","Personal Loan (1 = yes, 2 = no)","Age/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
# Plot CCAVG/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","CCAvg","CCAVG avg spend","Personal Loan (1 = yes, 0 = no)","CCAvg/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
#Get a facet plot of income by personal loan
p = so.Plot(Loan_Data, x="CCAvg")
p.facet("Personal_Loan").add(so.Area(), so.KDE(common_norm=["col"])).label(col="Personal Loan:")
#Get a facet plot of Credit card by personal loan
p = so.Plot(Loan_Data, x="Income")
p.facet("Personal_Loan").add(so.Area(), so.KDE(common_norm=["col"])).label(col="Personal Loan:")
# Plot INCOME/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","Income","INcome customer","Personal Loan (1 = yes, 0 = no)","Income/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
# Plot Mortgage/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","Mortgage","Mortgage customer","Personal Loan (1 = yes, 0 = no)","Mortgage/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
# Plot Security account/ PERSONAL LOAN
count_plot2("Personal_Loan","Securities account customer count","Personal Loan (1 = yes, 0 = no)","Securities account/ personal loan status","Securities_Account")
# Plot CD Account/ PERSONAL LOAN
count_plot2("Personal_Loan","CD Account customer count","Personal Loan (1 = yes, 0 = no)","CD Account / personal loan status","CD_Account")
# Plot Online Account/ PERSONAL LOAN
count_plot3("Personal_Loan","ONline Account customer count","Personal Loan (1 = yes, 0 = no)","Online / personal loan status","Online")
# Plot Credit Card Account/ PERSONAL LOAN
count_plot2("Personal_Loan","Credit Card customer count","Personal Loan (1 = yes, 0 = no)","Credit Card / personal loan status","CreditCard")
# Plot education, loan and income
plt.figure(figsize=(30, 10))
g = sns.catplot(
data=Loan_Data,
x="Income", y="Education", row="Personal_Loan",
kind="box", orient="h",
sharex=False, margin_titles=True,
height=1.5, aspect=7,
)
g.set(xlabel="Income Level in thousands of dollars", ylabel="Education level")
g.set_titles(row_template="Personal Loan = {row_name}")
for ax in g.axes.flat:
ax.xaxis.set_major_formatter('${x:.0f}')
plt.show()
<Figure size 3000x1000 with 0 Axes>
# Plot Education/ PERSONAL LOAN
plt.figure(figsize=(10, 5))
box_plot("Personal_Loan","Education","Education level 1=Undergrad 2=Grad 3=Advanced","Personal Loan (1 = yes, 0 = no)","Education/ personal loan status","Personal_Loan")
plt.show()
<Figure size 1000x500 with 0 Axes>
# CCavg plots
histogram_boxplot(Loan_Data, 'CCAvg','Credit card average of customer','Credit Card CAvg (thousands of dollars)','Credit Card Avg')
# Education plots
histogram_boxplot(Loan_Data, 'Education','Count of customer','Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional','Education')
#Plot average cc spend and family together
plt.figure(figsize=(10, 5))
box_plot("Family","CCAvg","Credit card spend / month in thousands","Family members in houshold","Credit card spend/Family size", hue="Family")
plt.show()
<Figure size 1000x500 with 0 Axes>
# Education vs card spend plotting
plt.figure(figsize=(15, 7))
bp = sns.boxplot(x="Education", y="CCAvg", data=Loan_Data)
bp.set(xlabel="Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional", ylabel="Credit card average spend / month", title="Education vs credit card spend")
plt.show()
### Income / Countys
plt.figure(figsize=(30,50))
sns.boxplot(x="Income", y="County", data=Loan_Data)
plt.show()
### Scatter plot for income and credit card
plt.figure(figsize=(30,10))
sns.scatterplot(x="CCAvg", y="Income", data=Loan_Data)
plt.show()
# print so I can reuse these in the correlation heat map
print(Loan_Data.columns)
Index(['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg',
'Education', 'Mortgage', 'Personal_Loan', 'Securities_Account',
'CD_Account', 'Online', 'CreditCard', 'City', 'County', 'State'],
dtype='object')
# Define column array to use in heatmap
columns= ['Income','Family', 'CCAvg','Education','Age','Mortgage', 'Personal_Loan', 'Experience','Securities_Account','CD_Account', 'Online', 'CreditCard', 'City', 'County', 'State']
# creates a heatmap or correlated columns
plt.figure(figsize=(15, 7))
sns.heatmap(
Loan_Data[columns].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
# Drop the ZIP column before we start the next analysis
Loan_Data.drop(columns=['ZIPCode'], inplace=True)
# Drop the Experience as it's so correlated with age and will not add value
Loan_Data.drop(columns=['Experience'], inplace=True)
# Drop the State as it's all california
Loan_Data.drop(columns=['State'], inplace=True)
# Detect and plot outliers in each column,
numerical_col = Loan_Data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(Loan_Data[variable], whis=1.5)
plt.title(variable)
plt.show()
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
#Treat outliers in a list of variables
for c in col_list:
df = treat_outliers(df, c)
return df
# output new dataframe without personal loan pior to dropping outliers (if we drop personal loan outliers our model will not work as personal loan is 1 and 0)
treatOut = Loan_Data.drop(columns=['Personal_Loan','Securities_Account','CD_Account'])
#treat all outliers by removing them
numerical_col = treatOut.select_dtypes(include=np.number).columns.tolist()
data = treat_outliers_all(treatOut, numerical_col)
# look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_col):
plt.subplot(5, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.title(variable)
plt.show()
Evaluate:
Under performance means:
Our most important predictor?
What will we do to improve success?
Creating training and test sets.
# Rechecking the state of the dataframe to ensure all columns are as expected
Loan_Data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Income 5000 non-null int64 2 Family 5000 non-null int64 3 CCAvg 5000 non-null float64 4 Education 5000 non-null int64 5 Mortgage 5000 non-null int64 6 Personal_Loan 5000 non-null int64 7 Securities_Account 5000 non-null int64 8 CD_Account 5000 non-null int64 9 Online 5000 non-null int64 10 CreditCard 5000 non-null int64 11 City 5000 non-null object 12 County 5000 non-null object dtypes: float64(1), int64(10), object(2) memory usage: 507.9+ KB
# specifying the independent X and dependent Y variables
X = Loan_Data.drop(["Personal_Loan"], axis=1)
Y = Loan_Data["Personal_Loan"]
# adding a constant to the independent variables
X = sm.add_constant(X)
# creating dummy variables
X = pd.get_dummies(X, drop_first=True)
# splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1
)
# Providing some printed information to aid in understanding traning and test sets
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("")
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("")
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
print("")
print("{0:0.2f}% data is in training set".format((len(X_train)/len(Loan_Data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(Loan_Data.index)) * 100))
Shape of Training set : (3500, 291) Shape of test set : (1500, 291) Percentage of classes in training set: 0 0.905429 1 0.094571 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.900667 1 0.099333 Name: Personal_Loan, dtype: float64 70.00% data is in training set 30.00% data is in test set
# we will use this to do all performance metrics and test various thresholds
# threshold has default value of .5
def confusionMatrixWithThreshold(model, predictors, target, threshold=0.5):
#setup variables to that will be passed to CM funciton
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
y_pred = np.round(pred_thres)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
# Plot heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# we will use this to do all performance metrics and test various thresholds
# threshold has default value of .5
def check_model_performance_classification_with_threshold(
model, predictors, target, threshold=0.5
):
# predicting using the independent variables
pred_prob = model.predict_proba(predictors)[:, 1]
pred_thres = pred_prob > threshold
pred = np.round(pred_thres)
# comopte accuracy, precision and recall anf fscore
acc = accuracy_score(target, pred)
recall = recall_score(target, pred)
precision = precision_score(target, pred)
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
index=[0],
)
return df_perf
# Building the LOgisitc regression model
lg = LogisticRegression(random_state=1)
#Fitting the training data
model = lg.fit(X_train, y_train)
# initial model
model_score = model.score(X_test, y_test)
print("The score indcates that " + str(round(model_score*100)) + "% of the time predictions are correct and " + str(round(100-(model_score*100))) + "% of the predictions are incorrect.")
The score indcates that 94% of the time predictions are correct and 6% of the predictions are incorrect.
# Find the roc auc score for training data
logit_roc_auc_train = roc_auc_score(
y_train, lg.predict_proba(X_train)[:, 1]
) # The indexing represents predicted probabilities for class 1
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
#setup plot
plt.plot(fpr, tpr, marker='.', label='Logistic')
#Create optimmal threshold marker
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
# labeling plot
plt.xlabel("False Positive")
plt.ylabel("True Positive")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Find the roc auc score for test data
logit_roc_auc_test = roc_auc_score(
y_test, lg.predict_proba(X_test)[:, 1]
)
# Find fpr, tpr and threshold values
fpr, tpr, thresholds = roc_curve(y_test, lg.predict_proba(X_test)[:, 1])
plt.figure(figsize=(7, 5))
# Plot roc curve
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_test)
plt.plot([0, 1], [0, 1], "b--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
# calculate the g-mean for each threshold
gmeans = sqrt(tpr * (1-fpr))
# locate the index of the largest g-mean
ix = argmax(gmeans)
plt.plot(fpr, tpr, marker='.', label='Logistic')
plt.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
plt.xlabel("False Positive")
plt.ylabel("True Positive")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
# Optimal threshold for AUC-ROC curve
fpr, tpr, thresholds = roc_curve(y_train, lg.predict_proba(X_train)[:, 1])
optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print("The optimal threshold for ROC AUC is " + str(optimal_threshold_auc_roc))
The optimal threshold for ROC AUC is 0.10923215086497905
#make confusion matrix
confusionMatrixWithThreshold(lg, X_train, y_train)
# checking model performance for this model with training data
logRegTrainPerf = check_model_performance_classification_with_threshold(lg, X_train, y_train)
print("Training performance:")
logRegTrainPerf
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953429 | 0.616314 | 0.85 | 0.714536 |
# create confusion matrix on test data
confusionMatrixWithThreshold(lg, X_test, y_test)
# checking model performance for this model with test data
logRegTestPerf = check_model_performance_classification_with_threshold(lg, X_test, y_test)
print("Test data performance:")
logRegTestPerf
Test data performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.943333 | 0.557047 | 0.813725 | 0.661355 |
# create confusion matrix training data with optimal Threshold
confusionMatrixWithThreshold(lg, X_train, y_train,threshold=optimal_threshold_auc_roc)
# checking model performance for this model with training data with optimal threshold
logRegTrainPerfThreshold_AucRoc = check_model_performance_classification_with_threshold(lg, X_train, y_train,threshold=optimal_threshold_auc_roc)
print("Training performance:")
logRegTrainPerfThreshold_AucRoc
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.883429 | 0.900302 | 0.442793 | 0.593625 |
# create confusion matrix test data with optimal Threshold
cm = confusionMatrixWithThreshold(lg, X_test, y_test,threshold=optimal_threshold_auc_roc)
# checking model performance for this model with test data with optimal threshold
logRegTestPerf_threshold_auc_roc = check_model_performance_classification_with_threshold(lg, X_test, y_test,threshold=optimal_threshold_auc_roc)
print("Test performance:")
logRegTestPerf_threshold_auc_roc
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.890667 | 0.865772 | 0.472527 | 0.611374 |
# PLot the precision recall curve
y_scores = lg.predict_proba(X_train)[:, 1]
prec, rec, tre = precision_recall_curve(y_train, y_scores,)
def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
plt.plot(thresholds, precisions[:-1], "y--", label="precision")
plt.plot(thresholds, recalls[:-1], "p--", label="recall")
plt.xlabel("Threshold")
plt.legend(loc="upper left")
plt.ylim([0, 1])
plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
# Setup confusion matrix training data with optimal Threshold
confusionMatrixWithThreshold(lg, X_train, y_train,.35)
# checking model performance for this model with training data with optimal threshold
logRegTrainPerf_threshold_P_Recall = check_model_performance_classification_with_threshold(lg, X_train, y_train,.35)
print("Training performance:")
logRegTrainPerf_threshold_P_Recall
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.948286 | 0.719033 | 0.730061 | 0.724505 |
Observation: As expected recall and precision very close together based on the optimal threshold calaculated on the training set.
# create confusion matrix test data with optimal Threshold
confusionMatrixWithThreshold(lg, X_test, y_test,threshold=.35)
# checking model performance for this model with test data with optimal threshold
logRegTestPerfThreshold_P_Recall = check_model_performance_classification_with_threshold(lg, X_test, y_test,threshold=.35)
print("The test performance:")
logRegTestPerfThreshold_P_Recall
The test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937333 | 0.644295 | 0.70073 | 0.671329 |
Observation: The test set was a bit lower on recall and precision.
# Training data performance comparison
trainComparison_df = pd.concat(
[
logRegTrainPerf.T,
logRegTrainPerfThreshold_AucRoc.T,
logRegTrainPerf_threshold_P_Recall.T,
],
axis=1,
)
trainComparison_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression with / ROC AUC thres",
"Logistic Regression with / prec recall thres",
]
print("Training data set performance comparison:")
trainComparison_df
Training data set performance comparison:
| Logistic Regression sklearn | Logistic Regression with / ROC AUC thres | Logistic Regression with / prec recall thres | |
|---|---|---|---|
| Accuracy | 0.953429 | 0.883429 | 0.948286 |
| Recall | 0.616314 | 0.900302 | 0.719033 |
| Precision | 0.850000 | 0.442793 | 0.730061 |
| F1 | 0.714536 | 0.593625 | 0.724505 |
# Test performance comparison
testComparison_df = pd.concat(
[
logRegTestPerf.T,
logRegTestPerf_threshold_auc_roc.T,
logRegTestPerfThreshold_P_Recall.T,
],
axis=1,
)
testComparison_df.columns = [
"Logistic Regression sklearn",
"Logistic Regression with optimal ROC AUC threshold",
"Logistic Regression with optimal precision recall",
]
print("Test data set performance comparison:")
testComparison_df
Test data set performance comparison:
| Logistic Regression sklearn | Logistic Regression with optimal ROC AUC threshold | Logistic Regression with optimal precision recall | |
|---|---|---|---|
| Accuracy | 0.943333 | 0.890667 | 0.937333 |
| Recall | 0.557047 | 0.865772 | 0.644295 |
| Precision | 0.813725 | 0.472527 | 0.700730 |
| F1 | 0.661355 | 0.611374 | 0.671329 |
# Building a new DEcision tree model and fitting data
loanTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
loanTree.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
#printing accuracy after fitting
print("Accuracy on training set : ",loanTree.score(X_train, y_train))
print("Accuracy on test set : ",loanTree.score(X_test, y_test))
Accuracy on training set : 1.0 Accuracy on test set : 0.9793333333333333
## Function to calculate recall score
def get_recall_score(model):
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
return [metrics.recall_score(y_train,pred_train),metrics.recall_score(y_test,pred_test)]
# Functon that will make a new confusion matrix
def generate_confusion_matrix(model, predictors, target):
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
#setup confusion matrix
generate_confusion_matrix(loanTree,X_test,y_test)
#variables to later compare
loanTree_acc_train = loanTree.score(X_train, y_train)
loanTree_acc_test = loanTree.score(X_test, y_test)
loanTree_recall = get_recall_score(loanTree)
# Accuracy on train and test
print("Accuracy on training data -- ",loanTree_acc_train)
print("Accuracy on test data -- ",loanTree_acc_test)
print("Recall on training data -- ",loanTree_recall[0])
print("Recall on test data -- ", loanTree_recall[1])
Accuracy on training data -- 1.0 Accuracy on test data -- 0.9793333333333333 Recall on training data -- 1.0 Recall on test data -- 0.8791946308724832
#print out the features so we can reuse in further code
feature_names = list(X.columns)
print(feature_names)
['const', 'Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'City_Alameda', 'City_Alamo', 'City_Albany', 'City_Alhambra', 'City_Anaheim', 'City_Antioch', 'City_Aptos', 'City_Arcadia', 'City_Arcata', 'City_Bakersfield', 'City_Baldwin Park', 'City_Banning', 'City_Bella Vista', 'City_Belmont', 'City_Belvedere Tiburon', 'City_Ben Lomond', 'City_Berkeley', 'City_Beverly Hills', 'City_Bodega Bay', 'City_Bonita', 'City_Boulder Creek', 'City_Brea', 'City_Brisbane', 'City_Burlingame', 'City_Calabasas', 'City_Camarillo', 'City_Campbell', 'City_Canoga Park', 'City_Capistrano Beach', 'City_Capitola', 'City_Cardiff By The Sea', 'City_Carlsbad', 'City_Carpinteria', 'City_Carson', 'City_Castro Valley', 'City_Ceres', 'City_Chatsworth', 'City_Chico', 'City_Chino', 'City_Chino Hills', 'City_Chula Vista', 'City_Citrus Heights', 'City_Claremont', 'City_Clearlake', 'City_Clovis', 'City_Concord', 'City_Costa Mesa', 'City_Crestline', 'City_Culver City', 'City_Cupertino', 'City_Cypress', 'City_Daly City', 'City_Danville', 'City_Davis', 'City_Diamond Bar', 'City_Edwards', 'City_El Dorado Hills', 'City_El Segundo', 'City_El Sobrante', 'City_Elk Grove', 'City_Emeryville', 'City_Encinitas', 'City_Escondido', 'City_Eureka', 'City_Fairfield', 'City_Fallbrook', 'City_Fawnskin', 'City_Folsom', 'City_Fremont', 'City_Fresno', 'City_Fullerton', 'City_Garden Grove', 'City_Gilroy', 'City_Glendale', 'City_Glendora', 'City_Goleta', 'City_Greenbrae', 'City_Hacienda Heights', 'City_Half Moon Bay', 'City_Hawthorne', 'City_Hayward', 'City_Hermosa Beach', 'City_Highland', 'City_Hollister', 'City_Hopland', 'City_Huntington Beach', 'City_Imperial', 'City_Inglewood', 'City_Irvine', 'City_La Jolla', 'City_La Mesa', 'City_La Mirada', 'City_La Palma', 'City_Ladera Ranch', 'City_Laguna Hills', 'City_Laguna Niguel', 'City_Lake Forest', 'City_Larkspur', 'City_Livermore', 'City_Loma Linda', 'City_Lomita', 'City_Lompoc', 'City_Long Beach', 'City_Los Alamitos', 'City_Los Altos', 'City_Los Angeles', 'City_Los Gatos', 'City_Manhattan Beach', 'City_March Air Reserve Base', 'City_Marina', 'City_Martinez', 'City_Menlo Park', 'City_Merced', 'City_Milpitas', 'City_Mission Hills', 'City_Mission Viejo', 'City_Modesto', 'City_Monrovia', 'City_Montague', 'City_Montclair', 'City_Montebello', 'City_Monterey', 'City_Monterey Park', 'City_Moraga', 'City_Morgan Hill', 'City_Moss Landing', 'City_Mountain View', 'City_Napa', 'City_National City', 'City_Newbury Park', 'City_Newport Beach', 'City_North Hills', 'City_North Hollywood', 'City_Northridge', 'City_Norwalk', 'City_Novato', 'City_Oak View', 'City_Oakland', 'City_Oceanside', 'City_Ojai', 'City_Orange', 'City_Oxnard', 'City_Pacific Grove', 'City_Pacific Palisades', 'City_Palo Alto', 'City_Palos Verdes Peninsula', 'City_Pasadena', 'City_Placentia', 'City_Pleasant Hill', 'City_Pleasanton', 'City_Pomona', 'City_Porter Ranch', 'City_Portola Valley', 'City_Poway', 'City_Rancho Cordova', 'City_Rancho Cucamonga', 'City_Rancho Palos Verdes', 'City_Redding', 'City_Redlands', 'City_Redondo Beach', 'City_Redwood City', 'City_Reseda', 'City_Richmond', 'City_Ridgecrest', 'City_Rio Vista', 'City_Riverside', 'City_Rohnert Park', 'City_Rosemead', 'City_Roseville', 'City_Sacramento', 'City_Salinas', 'City_San Anselmo', 'City_San Bernardino', 'City_San Bruno', 'City_San Clemente', 'City_San Diego', 'City_San Dimas', 'City_San Francisco', 'City_San Gabriel', 'City_San Jose', 'City_San Juan Bautista', 'City_San Juan Capistrano', 'City_San Leandro', 'City_San Luis Obispo', 'City_San Luis Rey', 'City_San Marcos', 'City_San Mateo', 'City_San Pablo', 'City_San Rafael', 'City_San Ramon', 'City_San Ysidro', 'City_Sanger', 'City_Santa Ana', 'City_Santa Barbara', 'City_Santa Clara', 'City_Santa Clarita', 'City_Santa Cruz', 'City_Santa Monica', 'City_Santa Rosa', 'City_Santa Ynez', 'City_Saratoga', 'City_Sausalito', 'City_Seal Beach', 'City_Seaside', 'City_Sherman Oaks', 'City_Sierra Madre', 'City_Signal Hill', 'City_Simi Valley', 'City_Sonora', 'City_South Gate', 'City_South Lake Tahoe', 'City_South Pasadena', 'City_South San Francisco', 'City_Stanford', 'City_Stinson Beach', 'City_Stockton', 'City_Studio City', 'City_Sunland', 'City_Sunnyvale', 'City_Sylmar', 'City_Tahoe City', 'City_Tehachapi', 'City_Thousand Oaks', 'City_Torrance', 'City_Trinity Center', 'City_Tustin', 'City_Ukiah', 'City_Upland', 'City_Valencia', 'City_Vallejo', 'City_Van Nuys', 'City_Venice', 'City_Ventura', 'City_Vista', 'City_Walnut Creek', 'City_Weed', 'City_West Covina', 'City_West Sacramento', 'City_Westlake Village', 'City_Whittier', 'City_Woodland Hills', 'City_Yorba Linda', 'City_Yucaipa', 'County_Butte County', 'County_Contra Costa County', 'County_El Dorado County', 'County_Fresno County', 'County_Humboldt County', 'County_Imperial County', 'County_Kern County', 'County_Lake County', 'County_Los Angeles County', 'County_Marin County', 'County_Mendocino County', 'County_Merced County', 'County_Monterey County', 'County_Napa County', 'County_Orange County', 'County_Placer County', 'County_Riverside County', 'County_Sacramento County', 'County_San Benito County', 'County_San Bernardino County', 'County_San Diego County', 'County_San Francisco County', 'County_San Joaquin County', 'County_San Luis Obispo County', 'County_San Mateo County', 'County_Santa Barbara County', 'County_Santa Clara County', 'County_Santa Cruz County', 'County_Shasta County', 'County_Siskiyou County', 'County_Solano County', 'County_Sonoma County', 'County_Stanislaus County', 'County_Trinity County', 'County_Tuolumne County', 'County_Ventura County', 'County_Yolo County']
#Visualize tree
plt.figure(figsize=(20,30))
tree.plot_tree(loanTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(loanTree,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- City_Cardiff By The Sea <= 0.50 | | | | | |--- City_Santa Barbara <= 0.50 | | | | | | |--- Age <= 28.50 | | | | | | | |--- Education <= 1.50 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Education > 1.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 28.50 | | | | | | | |--- weights: [58.00, 0.00] class: 0 | | | | | |--- City_Santa Barbara > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- City_Cardiff By The Sea > 0.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- County_San Diego County <= 0.50 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- County_San Diego County > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- County_Riverside County <= 0.50 | | | | | |--- City_Glendale <= 0.50 | | | | | | |--- Age <= 26.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 26.50 | | | | | | | |--- City_San Francisco <= 0.50 | | | | | | | | |--- City_Whittier <= 0.50 | | | | | | | | | |--- Age <= 62.50 | | | | | | | | | | |--- City_Stanford <= 0.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- City_Stanford > 0.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Age > 62.50 | | | | | | | | | | |--- Income <= 84.00 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- Income > 84.00 | | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- City_Whittier > 0.50 | | | | | | | | | |--- Income <= 71.00 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Income > 71.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- City_San Francisco > 0.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- City_Glendale > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- County_Riverside County > 0.50 | | | | | |--- Family <= 2.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- Family <= 3.50 | | | | | | |--- City_Irvine <= 0.50 | | | | | | | |--- City_Los Angeles <= 0.50 | | | | | | | | |--- weights: [27.00, 0.00] class: 0 | | | | | | | |--- City_Los Angeles > 0.50 | | | | | | | | |--- Income <= 102.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Income > 102.00 | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | |--- City_Irvine > 0.50 | | | | | | | |--- Income <= 109.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 109.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- CCAvg <= 4.20 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- CCAvg > 4.20 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- Age <= 63.50 | | | | | |--- Mortgage <= 172.00 | | | | | | |--- CD_Account <= 0.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- weights: [0.00, 21.00] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- CCAvg <= 3.75 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.75 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- CD_Account > 0.50 | | | | | | | |--- Income <= 98.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Income > 98.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Mortgage > 172.00 | | | | | | |--- Family <= 2.50 | | | | | | | |--- Income <= 100.00 | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | |--- Income > 100.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Family > 2.50 | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- Age > 63.50 | | | | | |--- weights: [2.00, 0.00] class: 0 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
# Show important features ranked, convert to markdown
print (pd.DataFrame(loanTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).to_markdown())
| | Imp | |:------------------------------|------------:| | Education | 0.406402 | | Income | 0.320267 | | Family | 0.15404 | | CCAvg | 0.0432462 | | CD_Account | 0.0257115 | | Age | 0.0251519 | | County_Riverside County | 0.00354378 | | City_Santa Barbara | 0.00318351 | | City_Cardiff By The Sea | 0.00308704 | | County_San Diego County | 0.00308004 | | Mortgage | 0.00301439 | | City_Glendale | 0.00294379 | | City_San Francisco | 0.0017001 | | City_Whittier | 0.00147154 | | City_Irvine | 0.00138007 | | City_Berkeley | 0.000635863 | | City_Stanford | 0.000578578 | | City_Los Angeles | 0.000563069 | | City_San Pablo | 0 | | City_Sierra Madre | 0 | | City_Sherman Oaks | 0 | | City_San Gabriel | 0 | | City_Seaside | 0 | | City_San Jose | 0 | | City_San Juan Bautista | 0 | | City_San Juan Capistrano | 0 | | City_San Leandro | 0 | | City_San Luis Obispo | 0 | | City_San Luis Rey | 0 | | City_San Marcos | 0 | | City_San Mateo | 0 | | City_Santa Rosa | 0 | | City_Santa Monica | 0 | | City_San Ramon | 0 | | City_San Ysidro | 0 | | City_Seal Beach | 0 | | City_Sausalito | 0 | | City_Saratoga | 0 | | City_Sanger | 0 | | City_Santa Ana | 0 | | City_San Dimas | 0 | | City_Santa Ynez | 0 | | City_Santa Clara | 0 | | City_Santa Clarita | 0 | | City_Santa Cruz | 0 | | City_San Rafael | 0 | | City_San Anselmo | 0 | | City_San Diego | 0 | | City_Pomona | 0 | | City_Rancho Palos Verdes | 0 | | City_Rancho Cucamonga | 0 | | City_Rancho Cordova | 0 | | City_Poway | 0 | | City_Portola Valley | 0 | | City_Porter Ranch | 0 | | City_Pleasanton | 0 | | City_San Clemente | 0 | | City_Pleasant Hill | 0 | | City_Placentia | 0 | | City_Pasadena | 0 | | City_Palos Verdes Peninsula | 0 | | City_Palo Alto | 0 | | City_Pacific Palisades | 0 | | City_Redding | 0 | | City_Redlands | 0 | | City_Redondo Beach | 0 | | City_Redwood City | 0 | | City_Reseda | 0 | | City_Richmond | 0 | | City_Ridgecrest | 0 | | City_Rio Vista | 0 | | City_Riverside | 0 | | City_Rohnert Park | 0 | | City_Rosemead | 0 | | City_Roseville | 0 | | City_Sacramento | 0 | | City_Salinas | 0 | | City_Simi Valley | 0 | | City_San Bernardino | 0 | | City_San Bruno | 0 | | City_Signal Hill | 0 | | const | 0 | | City_Sonora | 0 | | County_Mendocino County | 0 | | County_Sacramento County | 0 | | County_Placer County | 0 | | County_Orange County | 0 | | County_Napa County | 0 | | County_Monterey County | 0 | | County_Merced County | 0 | | County_Marin County | 0 | | County_Contra Costa County | 0 | | County_Los Angeles County | 0 | | County_Lake County | 0 | | County_Kern County | 0 | | County_Imperial County | 0 | | County_Humboldt County | 0 | | County_Fresno County | 0 | | County_San Benito County | 0 | | County_San Bernardino County | 0 | | County_San Francisco County | 0 | | County_San Joaquin County | 0 | | County_San Luis Obispo County | 0 | | County_San Mateo County | 0 | | County_Santa Barbara County | 0 | | County_Santa Clara County | 0 | | County_Santa Cruz County | 0 | | County_Shasta County | 0 | | County_Siskiyou County | 0 | | County_Solano County | 0 | | County_Sonoma County | 0 | | County_Stanislaus County | 0 | | County_Trinity County | 0 | | County_Tuolumne County | 0 | | County_Ventura County | 0 | | County_El Dorado County | 0 | | County_Butte County | 0 | | City_South Gate | 0 | | City_Sunland | 0 | | City_Torrance | 0 | | City_Thousand Oaks | 0 | | City_Tehachapi | 0 | | City_Tahoe City | 0 | | City_Sylmar | 0 | | City_Sunnyvale | 0 | | City_Studio City | 0 | | City_Yucaipa | 0 | | City_Stockton | 0 | | City_Stinson Beach | 0 | | City_Oxnard | 0 | | City_South San Francisco | 0 | | City_South Pasadena | 0 | | City_South Lake Tahoe | 0 | | City_Trinity Center | 0 | | City_Tustin | 0 | | City_Ukiah | 0 | | City_Upland | 0 | | City_Valencia | 0 | | City_Vallejo | 0 | | City_Van Nuys | 0 | | City_Venice | 0 | | City_Ventura | 0 | | City_Vista | 0 | | City_Walnut Creek | 0 | | City_Weed | 0 | | City_West Covina | 0 | | City_West Sacramento | 0 | | City_Westlake Village | 0 | | City_Woodland Hills | 0 | | City_Yorba Linda | 0 | | City_Pacific Grove | 0 | | City_Norwalk | 0 | | City_Orange | 0 | | City_Citrus Heights | 0 | | City_Crestline | 0 | | City_Costa Mesa | 0 | | City_Concord | 0 | | City_Clovis | 0 | | City_Clearlake | 0 | | City_Claremont | 0 | | City_Chula Vista | 0 | | City_Carpinteria | 0 | | City_Chino Hills | 0 | | City_Chino | 0 | | City_Chico | 0 | | City_Chatsworth | 0 | | City_Ceres | 0 | | City_Castro Valley | 0 | | City_Culver City | 0 | | City_Cupertino | 0 | | City_Cypress | 0 | | City_Daly City | 0 | | City_Danville | 0 | | City_Davis | 0 | | City_Diamond Bar | 0 | | City_Edwards | 0 | | City_El Dorado Hills | 0 | | City_El Segundo | 0 | | City_El Sobrante | 0 | | City_Elk Grove | 0 | | City_Emeryville | 0 | | City_Encinitas | 0 | | City_Escondido | 0 | | City_Eureka | 0 | | City_Fairfield | 0 | | City_Carson | 0 | | City_Carlsbad | 0 | | City_Fawnskin | 0 | | City_Anaheim | 0 | | City_Baldwin Park | 0 | | City_Bakersfield | 0 | | City_Arcata | 0 | | City_Arcadia | 0 | | City_Aptos | 0 | | City_Antioch | 0 | | City_Alhambra | 0 | | City_Capitola | 0 | | City_Albany | 0 | | City_Alamo | 0 | | City_Alameda | 0 | | CreditCard | 0 | | Online | 0 | | Securities_Account | 0 | | City_Banning | 0 | | City_Bella Vista | 0 | | City_Belmont | 0 | | City_Belvedere Tiburon | 0 | | City_Ben Lomond | 0 | | City_Beverly Hills | 0 | | City_Bodega Bay | 0 | | City_Bonita | 0 | | City_Boulder Creek | 0 | | City_Brea | 0 | | City_Brisbane | 0 | | City_Burlingame | 0 | | City_Calabasas | 0 | | City_Camarillo | 0 | | City_Campbell | 0 | | City_Canoga Park | 0 | | City_Capistrano Beach | 0 | | City_Fallbrook | 0 | | City_Folsom | 0 | | City_Ojai | 0 | | City_Mission Hills | 0 | | City_Montebello | 0 | | City_Montclair | 0 | | City_Montague | 0 | | City_Monrovia | 0 | | City_Modesto | 0 | | City_Mission Viejo | 0 | | City_Milpitas | 0 | | City_Los Altos | 0 | | City_Merced | 0 | | City_Menlo Park | 0 | | City_Martinez | 0 | | City_Marina | 0 | | City_March Air Reserve Base | 0 | | City_Manhattan Beach | 0 | | City_Monterey | 0 | | City_Monterey Park | 0 | | City_Moraga | 0 | | City_Morgan Hill | 0 | | City_Moss Landing | 0 | | City_Mountain View | 0 | | City_Napa | 0 | | City_National City | 0 | | City_Newbury Park | 0 | | City_Newport Beach | 0 | | City_North Hills | 0 | | City_North Hollywood | 0 | | City_Northridge | 0 | | City_Novato | 0 | | City_Oak View | 0 | | City_Oakland | 0 | | City_Oceanside | 0 | | City_Los Gatos | 0 | | City_Los Alamitos | 0 | | City_Fremont | 0 | | City_Hacienda Heights | 0 | | City_Hollister | 0 | | City_Highland | 0 | | City_Hermosa Beach | 0 | | City_Hayward | 0 | | City_Hawthorne | 0 | | City_Half Moon Bay | 0 | | City_Greenbrae | 0 | | City_Long Beach | 0 | | City_Goleta | 0 | | City_Glendora | 0 | | City_Gilroy | 0 | | City_Garden Grove | 0 | | City_Fullerton | 0 | | City_Fresno | 0 | | City_Hopland | 0 | | City_Huntington Beach | 0 | | City_Imperial | 0 | | City_Inglewood | 0 | | City_La Jolla | 0 | | City_La Mesa | 0 | | City_La Mirada | 0 | | City_La Palma | 0 | | City_Ladera Ranch | 0 | | City_Laguna Hills | 0 | | City_Laguna Niguel | 0 | | City_Lake Forest | 0 | | City_Larkspur | 0 | | City_Livermore | 0 | | City_Loma Linda | 0 | | City_Lomita | 0 | | City_Lompoc | 0 | | County_Yolo County | 0 |
# pre-prune and refit model
loanTree_prePruned = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
loanTree_prePruned.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, random_state=1)
# Confusion matrix with a limit of 3 nodes deep
generate_confusion_matrix(loanTree_prePruned,X_test,y_test)
#variables to later compare
loanTree_prePruned_acc_train = loanTree_prePruned.score(X_train, y_train)
loanTree_prePruned_acc_test = loanTree_prePruned.score(X_test, y_test)
loanTree_prePruned_recall = get_recall_score(loanTree_prePruned)
# Accuracy on train and test
print("Accuracy on training data -- ",loanTree_prePruned_acc_train)
print("Accuracy on test data -- ",loanTree_prePruned_acc_test)
print("Recall on training data -- ",loanTree_prePruned_recall[0])
print("Recall on test data -- ", loanTree_prePruned_recall[1])
Accuracy on training data -- 0.9822857142857143 Accuracy on test data -- 0.9753333333333334 Recall on training data -- 0.8126888217522659 Recall on test data -- 0.7516778523489933
# visualize tree
plt.figure(figsize=(15,10))
tree.plot_tree(loanTree_prePruned,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
#show feature importance
print (pd.DataFrame(loanTree_prePruned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).to_markdown())
| | Imp | |:------------------------------|----------:| | Education | 0.446593 | | Income | 0.346997 | | Family | 0.162372 | | CCAvg | 0.0440379 | | const | 0 | | City_San Leandro | 0 | | City_San Rafael | 0 | | City_San Pablo | 0 | | City_San Mateo | 0 | | City_San Marcos | 0 | | City_San Luis Rey | 0 | | City_San Luis Obispo | 0 | | City_San Jose | 0 | | City_San Juan Capistrano | 0 | | City_San Juan Bautista | 0 | | City_San Ysidro | 0 | | City_San Gabriel | 0 | | City_San Francisco | 0 | | City_San Dimas | 0 | | City_San Diego | 0 | | City_San Ramon | 0 | | City_Santa Barbara | 0 | | City_Sanger | 0 | | City_Saratoga | 0 | | City_Signal Hill | 0 | | City_Sierra Madre | 0 | | City_Sherman Oaks | 0 | | City_Seaside | 0 | | City_Seal Beach | 0 | | City_Sausalito | 0 | | City_Santa Ynez | 0 | | City_Santa Ana | 0 | | City_Santa Rosa | 0 | | City_Santa Monica | 0 | | City_Santa Cruz | 0 | | City_Santa Clarita | 0 | | City_Santa Clara | 0 | | City_San Bruno | 0 | | City_San Clemente | 0 | | City_Salinas | 0 | | City_San Bernardino | 0 | | City_Pasadena | 0 | | City_Portola Valley | 0 | | City_Porter Ranch | 0 | | City_Pomona | 0 | | City_Pleasanton | 0 | | City_Pleasant Hill | 0 | | City_Placentia | 0 | | City_Palos Verdes Peninsula | 0 | | City_Rancho Cordova | 0 | | City_Palo Alto | 0 | | City_Pacific Palisades | 0 | | City_Pacific Grove | 0 | | City_Oxnard | 0 | | City_Orange | 0 | | City_Ojai | 0 | | City_Poway | 0 | | City_Rancho Cucamonga | 0 | | City_San Anselmo | 0 | | City_Rio Vista | 0 | | City_Sonora | 0 | | City_Sacramento | 0 | | City_Roseville | 0 | | City_Rosemead | 0 | | City_Rohnert Park | 0 | | City_Riverside | 0 | | City_Ridgecrest | 0 | | City_Rancho Palos Verdes | 0 | | City_Richmond | 0 | | City_Reseda | 0 | | City_Redwood City | 0 | | City_Redondo Beach | 0 | | City_Redlands | 0 | | City_Redding | 0 | | City_Simi Valley | 0 | | City_South Pasadena | 0 | | City_South Gate | 0 | | County_Merced County | 0 | | County_Sacramento County | 0 | | County_Riverside County | 0 | | County_Placer County | 0 | | County_Orange County | 0 | | County_Napa County | 0 | | County_Monterey County | 0 | | County_Mendocino County | 0 | | County_San Bernardino County | 0 | | County_Marin County | 0 | | County_Los Angeles County | 0 | | County_Lake County | 0 | | County_Kern County | 0 | | County_Imperial County | 0 | | County_Humboldt County | 0 | | County_San Benito County | 0 | | County_San Diego County | 0 | | County_El Dorado County | 0 | | County_Siskiyou County | 0 | | County_Ventura County | 0 | | County_Tuolumne County | 0 | | County_Trinity County | 0 | | County_Stanislaus County | 0 | | County_Sonoma County | 0 | | County_Solano County | 0 | | County_Shasta County | 0 | | County_San Francisco County | 0 | | County_Santa Cruz County | 0 | | County_Santa Clara County | 0 | | County_Santa Barbara County | 0 | | County_San Mateo County | 0 | | County_San Luis Obispo County | 0 | | County_San Joaquin County | 0 | | County_Fresno County | 0 | | County_Contra Costa County | 0 | | City_South Lake Tahoe | 0 | | City_Sunnyvale | 0 | | City_Trinity Center | 0 | | City_Torrance | 0 | | City_Thousand Oaks | 0 | | City_Tehachapi | 0 | | City_Tahoe City | 0 | | City_Sylmar | 0 | | City_Sunland | 0 | | City_Ukiah | 0 | | City_Studio City | 0 | | City_Stockton | 0 | | City_Stinson Beach | 0 | | City_Stanford | 0 | | City_South San Francisco | 0 | | City_Oakland | 0 | | City_Tustin | 0 | | City_Upland | 0 | | County_Butte County | 0 | | City_West Covina | 0 | | City_Yucaipa | 0 | | City_Yorba Linda | 0 | | City_Woodland Hills | 0 | | City_Whittier | 0 | | City_Westlake Village | 0 | | City_West Sacramento | 0 | | City_Weed | 0 | | City_Valencia | 0 | | City_Walnut Creek | 0 | | City_Vista | 0 | | City_Ventura | 0 | | City_Venice | 0 | | City_Van Nuys | 0 | | City_Vallejo | 0 | | City_Oceanside | 0 | | City_Norwalk | 0 | | City_Oak View | 0 | | City_Chino | 0 | | City_Clovis | 0 | | City_Clearlake | 0 | | City_Claremont | 0 | | City_Citrus Heights | 0 | | City_Chula Vista | 0 | | City_Chino Hills | 0 | | City_Chico | 0 | | City_Costa Mesa | 0 | | City_Chatsworth | 0 | | City_Ceres | 0 | | City_Castro Valley | 0 | | City_Carson | 0 | | City_Carpinteria | 0 | | City_Carlsbad | 0 | | City_Concord | 0 | | City_Crestline | 0 | | City_Capitola | 0 | | City_El Dorado Hills | 0 | | City_Escondido | 0 | | City_Encinitas | 0 | | City_Emeryville | 0 | | City_Elk Grove | 0 | | City_El Sobrante | 0 | | City_El Segundo | 0 | | City_Edwards | 0 | | City_Culver City | 0 | | City_Diamond Bar | 0 | | City_Davis | 0 | | City_Danville | 0 | | City_Daly City | 0 | | City_Cypress | 0 | | City_Cupertino | 0 | | City_Cardiff By The Sea | 0 | | City_Capistrano Beach | 0 | | City_Fairfield | 0 | | City_Albany | 0 | | City_Arcata | 0 | | City_Arcadia | 0 | | City_Aptos | 0 | | City_Antioch | 0 | | City_Anaheim | 0 | | City_Alhambra | 0 | | City_Alamo | 0 | | City_Baldwin Park | 0 | | City_Alameda | 0 | | CreditCard | 0 | | Online | 0 | | CD_Account | 0 | | Securities_Account | 0 | | Mortgage | 0 | | City_Bakersfield | 0 | | City_Banning | 0 | | City_Canoga Park | 0 | | City_Boulder Creek | 0 | | City_Campbell | 0 | | City_Camarillo | 0 | | City_Calabasas | 0 | | City_Burlingame | 0 | | City_Brisbane | 0 | | City_Brea | 0 | | City_Bonita | 0 | | City_Bella Vista | 0 | | City_Bodega Bay | 0 | | City_Beverly Hills | 0 | | City_Berkeley | 0 | | City_Ben Lomond | 0 | | City_Belvedere Tiburon | 0 | | City_Belmont | 0 | | City_Eureka | 0 | | City_Fallbrook | 0 | | City_Novato | 0 | | City_Martinez | 0 | | City_Modesto | 0 | | City_Mission Viejo | 0 | | City_Mission Hills | 0 | | City_Milpitas | 0 | | City_Merced | 0 | | City_Menlo Park | 0 | | City_Marina | 0 | | City_Montague | 0 | | City_March Air Reserve Base | 0 | | City_Manhattan Beach | 0 | | City_Los Gatos | 0 | | City_Los Angeles | 0 | | City_Los Altos | 0 | | City_Los Alamitos | 0 | | City_Monrovia | 0 | | City_Montclair | 0 | | City_Lompoc | 0 | | City_National City | 0 | | Age | 0 | | City_Northridge | 0 | | City_North Hollywood | 0 | | City_North Hills | 0 | | City_Newport Beach | 0 | | City_Newbury Park | 0 | | City_Napa | 0 | | City_Montebello | 0 | | City_Mountain View | 0 | | City_Moss Landing | 0 | | City_Morgan Hill | 0 | | City_Moraga | 0 | | City_Monterey Park | 0 | | City_Monterey | 0 | | City_Long Beach | 0 | | City_Lomita | 0 | | City_Fawnskin | 0 | | City_Glendora | 0 | | City_Hayward | 0 | | City_Hawthorne | 0 | | City_Half Moon Bay | 0 | | City_Hacienda Heights | 0 | | City_Greenbrae | 0 | | City_Goleta | 0 | | City_Glendale | 0 | | City_Highland | 0 | | City_Gilroy | 0 | | City_Garden Grove | 0 | | City_Fullerton | 0 | | City_Fresno | 0 | | City_Fremont | 0 | | City_Folsom | 0 | | City_Hermosa Beach | 0 | | City_Hollister | 0 | | City_Loma Linda | 0 | | City_La Palma | 0 | | City_Livermore | 0 | | City_Larkspur | 0 | | City_Lake Forest | 0 | | City_Laguna Niguel | 0 | | City_Laguna Hills | 0 | | City_Ladera Ranch | 0 | | City_La Mirada | 0 | | City_Hopland | 0 | | City_La Mesa | 0 | | City_La Jolla | 0 | | City_Irvine | 0 | | City_Inglewood | 0 | | City_Imperial | 0 | | City_Huntington Beach | 0 | | County_Yolo County | 0 |
Observation: Feature importance has reduced to only 4 features given we have not done a tree down far enough to capture the informaiton on other features.
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(1,10),
'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
'max_leaf_nodes' : [2, 3, 5, 10],
'min_impurity_decrease': [0.001,0.01,0.1]
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
loanTree_grid = grid_obj.best_estimator_
# Fit the best algorithm to the data.
loanTree_grid.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10,
min_impurity_decrease=0.001, random_state=1)# Confusion matrix for grid search
cm = generate_confusion_matrix(loanTree_grid,X_test,y_test)
Observation: see a bit lower Type 2 error rate which is good indication of improvement
#variables to later compare
loanTree_grid_acc_train = loanTree_grid.score(X_train, y_train)
loanTree_grid_acc_test = loanTree_grid.score(X_test, y_test)
loanTree_grid_recall = get_recall_score(loanTree_grid)
# Accuracy on train and test
print("Accuracy on training data -- ",loanTree_grid_acc_train)
print("Accuracy on test data -- ",loanTree_grid_acc_test)
print("Recall on training data -- ",loanTree_grid_recall[0])
print("Recall on test data -- ", loanTree_grid_recall[1])
Accuracy on training data -- 0.9897142857142858 Accuracy on test data -- 0.9813333333333333 Recall on training data -- 0.9274924471299094 Recall on test data -- 0.8791946308724832
Observation: The recall has improved to a high level and the FN is also quite low which will limit the loss of opportunity.
#visualize the decision tree into a tree plot
plt.figure(figsize=(15,10))
tree.plot_tree(loanTree_grid,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Printing a text report that defines the tree in text format
print(tree.export_text(loanTree_grid,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- weights: [2632.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [11.00, 28.00] class: 1 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
#Printing the Feature importance
print (pd.DataFrame(loanTree_grid.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).to_markdown())
| | Imp | |:------------------------------|----------:| | Education | 0.447999 | | Income | 0.328713 | | Family | 0.155711 | | CCAvg | 0.0422313 | | CD_Account | 0.0253454 | | const | 0 | | City_San Leandro | 0 | | City_San Rafael | 0 | | City_San Pablo | 0 | | City_San Mateo | 0 | | City_San Marcos | 0 | | City_San Luis Rey | 0 | | City_San Luis Obispo | 0 | | City_San Jose | 0 | | City_San Juan Capistrano | 0 | | City_San Juan Bautista | 0 | | City_San Ysidro | 0 | | City_San Gabriel | 0 | | City_San Francisco | 0 | | City_San Dimas | 0 | | City_San Diego | 0 | | City_San Ramon | 0 | | City_Santa Barbara | 0 | | City_Sanger | 0 | | City_Saratoga | 0 | | City_Signal Hill | 0 | | City_Sierra Madre | 0 | | City_Sherman Oaks | 0 | | City_Seaside | 0 | | City_Seal Beach | 0 | | City_Sausalito | 0 | | City_Santa Ynez | 0 | | City_Santa Ana | 0 | | City_Santa Rosa | 0 | | City_Santa Monica | 0 | | City_Santa Cruz | 0 | | City_Santa Clarita | 0 | | City_Santa Clara | 0 | | City_San Bruno | 0 | | City_San Clemente | 0 | | City_Salinas | 0 | | City_San Bernardino | 0 | | City_Pasadena | 0 | | City_Portola Valley | 0 | | City_Porter Ranch | 0 | | City_Pomona | 0 | | City_Pleasanton | 0 | | City_Pleasant Hill | 0 | | City_Placentia | 0 | | City_Palos Verdes Peninsula | 0 | | City_Rancho Cordova | 0 | | City_Palo Alto | 0 | | City_Pacific Palisades | 0 | | City_Pacific Grove | 0 | | City_Oxnard | 0 | | City_Orange | 0 | | City_Ojai | 0 | | City_Poway | 0 | | City_Rancho Cucamonga | 0 | | City_San Anselmo | 0 | | City_Rio Vista | 0 | | City_Sonora | 0 | | City_Sacramento | 0 | | City_Roseville | 0 | | City_Rosemead | 0 | | City_Rohnert Park | 0 | | City_Riverside | 0 | | City_Ridgecrest | 0 | | City_Rancho Palos Verdes | 0 | | City_Richmond | 0 | | City_Reseda | 0 | | City_Redwood City | 0 | | City_Redondo Beach | 0 | | City_Redlands | 0 | | City_Redding | 0 | | City_Simi Valley | 0 | | City_South Pasadena | 0 | | City_South Gate | 0 | | County_Merced County | 0 | | County_Sacramento County | 0 | | County_Riverside County | 0 | | County_Placer County | 0 | | County_Orange County | 0 | | County_Napa County | 0 | | County_Monterey County | 0 | | County_Mendocino County | 0 | | County_San Bernardino County | 0 | | County_Marin County | 0 | | County_Los Angeles County | 0 | | County_Lake County | 0 | | County_Kern County | 0 | | County_Imperial County | 0 | | County_Humboldt County | 0 | | County_San Benito County | 0 | | County_San Diego County | 0 | | County_El Dorado County | 0 | | County_Siskiyou County | 0 | | County_Ventura County | 0 | | County_Tuolumne County | 0 | | County_Trinity County | 0 | | County_Stanislaus County | 0 | | County_Sonoma County | 0 | | County_Solano County | 0 | | County_Shasta County | 0 | | County_San Francisco County | 0 | | County_Santa Cruz County | 0 | | County_Santa Clara County | 0 | | County_Santa Barbara County | 0 | | County_San Mateo County | 0 | | County_San Luis Obispo County | 0 | | County_San Joaquin County | 0 | | County_Fresno County | 0 | | County_Contra Costa County | 0 | | City_South Lake Tahoe | 0 | | City_Sunnyvale | 0 | | City_Trinity Center | 0 | | City_Torrance | 0 | | City_Thousand Oaks | 0 | | City_Tehachapi | 0 | | City_Tahoe City | 0 | | City_Sylmar | 0 | | City_Sunland | 0 | | City_Ukiah | 0 | | City_Studio City | 0 | | City_Stockton | 0 | | City_Stinson Beach | 0 | | City_Stanford | 0 | | City_South San Francisco | 0 | | City_Oakland | 0 | | City_Tustin | 0 | | City_Upland | 0 | | County_Butte County | 0 | | City_West Covina | 0 | | City_Yucaipa | 0 | | City_Yorba Linda | 0 | | City_Woodland Hills | 0 | | City_Whittier | 0 | | City_Westlake Village | 0 | | City_West Sacramento | 0 | | City_Weed | 0 | | City_Valencia | 0 | | City_Walnut Creek | 0 | | City_Vista | 0 | | City_Ventura | 0 | | City_Venice | 0 | | City_Van Nuys | 0 | | City_Vallejo | 0 | | City_Oceanside | 0 | | City_Norwalk | 0 | | City_Oak View | 0 | | City_Chino Hills | 0 | | City_Concord | 0 | | City_Clovis | 0 | | City_Clearlake | 0 | | City_Claremont | 0 | | City_Citrus Heights | 0 | | City_Chula Vista | 0 | | City_Chino | 0 | | City_Crestline | 0 | | City_Chico | 0 | | City_Chatsworth | 0 | | City_Ceres | 0 | | City_Castro Valley | 0 | | City_Carson | 0 | | City_Carpinteria | 0 | | City_Costa Mesa | 0 | | City_Culver City | 0 | | City_Cardiff By The Sea | 0 | | City_El Segundo | 0 | | City_Eureka | 0 | | City_Escondido | 0 | | City_Encinitas | 0 | | City_Emeryville | 0 | | City_Elk Grove | 0 | | City_El Sobrante | 0 | | City_El Dorado Hills | 0 | | City_Cupertino | 0 | | City_Edwards | 0 | | City_Diamond Bar | 0 | | City_Davis | 0 | | City_Danville | 0 | | City_Daly City | 0 | | City_Cypress | 0 | | City_Carlsbad | 0 | | City_Capitola | 0 | | City_Novato | 0 | | City_Alhambra | 0 | | City_Bakersfield | 0 | | City_Arcata | 0 | | City_Arcadia | 0 | | City_Aptos | 0 | | City_Antioch | 0 | | City_Anaheim | 0 | | City_Albany | 0 | | City_Banning | 0 | | City_Alamo | 0 | | City_Alameda | 0 | | CreditCard | 0 | | Online | 0 | | Securities_Account | 0 | | Mortgage | 0 | | City_Baldwin Park | 0 | | City_Bella Vista | 0 | | City_Capistrano Beach | 0 | | City_Brea | 0 | | City_Canoga Park | 0 | | City_Campbell | 0 | | City_Camarillo | 0 | | City_Calabasas | 0 | | City_Burlingame | 0 | | City_Brisbane | 0 | | City_Boulder Creek | 0 | | City_Belmont | 0 | | City_Bonita | 0 | | City_Bodega Bay | 0 | | City_Beverly Hills | 0 | | City_Berkeley | 0 | | City_Ben Lomond | 0 | | City_Belvedere Tiburon | 0 | | City_Fairfield | 0 | | City_Fallbrook | 0 | | City_Fawnskin | 0 | | City_Martinez | 0 | | City_Modesto | 0 | | City_Mission Viejo | 0 | | City_Mission Hills | 0 | | City_Milpitas | 0 | | City_Merced | 0 | | City_Menlo Park | 0 | | City_Marina | 0 | | City_Montague | 0 | | City_March Air Reserve Base | 0 | | City_Manhattan Beach | 0 | | City_Los Gatos | 0 | | City_Los Angeles | 0 | | City_Los Altos | 0 | | City_Los Alamitos | 0 | | City_Monrovia | 0 | | City_Montclair | 0 | | City_Folsom | 0 | | City_National City | 0 | | Age | 0 | | City_Northridge | 0 | | City_North Hollywood | 0 | | City_North Hills | 0 | | City_Newport Beach | 0 | | City_Newbury Park | 0 | | City_Napa | 0 | | City_Montebello | 0 | | City_Mountain View | 0 | | City_Moss Landing | 0 | | City_Morgan Hill | 0 | | City_Moraga | 0 | | City_Monterey Park | 0 | | City_Monterey | 0 | | City_Long Beach | 0 | | City_Lompoc | 0 | | City_Lomita | 0 | | City_Goleta | 0 | | City_Hermosa Beach | 0 | | City_Hayward | 0 | | City_Hawthorne | 0 | | City_Half Moon Bay | 0 | | City_Hacienda Heights | 0 | | City_Greenbrae | 0 | | City_Glendora | 0 | | City_Loma Linda | 0 | | City_Glendale | 0 | | City_Gilroy | 0 | | City_Garden Grove | 0 | | City_Fullerton | 0 | | City_Fresno | 0 | | City_Fremont | 0 | | City_Highland | 0 | | City_Hollister | 0 | | City_Hopland | 0 | | City_Huntington Beach | 0 | | City_Imperial | 0 | | City_Inglewood | 0 | | City_Irvine | 0 | | City_La Jolla | 0 | | City_La Mesa | 0 | | City_La Mirada | 0 | | City_La Palma | 0 | | City_Ladera Ranch | 0 | | City_Laguna Hills | 0 | | City_Laguna Niguel | 0 | | City_Lake Forest | 0 | | City_Larkspur | 0 | | City_Livermore | 0 | | County_Yolo County | 0 |
# Setup decision classifier
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
#show alphas and impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000189 | 0.000566 |
| 2 | 0.000269 | 0.001642 |
| 3 | 0.000273 | 0.003283 |
| 4 | 0.000281 | 0.003845 |
| 5 | 0.000381 | 0.004226 |
| 6 | 0.000381 | 0.004607 |
| 7 | 0.000381 | 0.004988 |
| 8 | 0.000381 | 0.005369 |
| 9 | 0.000476 | 0.005845 |
| 10 | 0.000517 | 0.007915 |
| 11 | 0.000527 | 0.008442 |
| 12 | 0.000537 | 0.009516 |
| 13 | 0.000582 | 0.010098 |
| 14 | 0.000593 | 0.011283 |
| 15 | 0.000607 | 0.011890 |
| 16 | 0.000641 | 0.014456 |
| 17 | 0.000882 | 0.017985 |
| 18 | 0.001552 | 0.019536 |
| 19 | 0.002333 | 0.021869 |
| 20 | 0.003024 | 0.024893 |
| 21 | 0.003294 | 0.028187 |
| 22 | 0.006473 | 0.034659 |
| 23 | 0.023866 | 0.058525 |
| 24 | 0.056365 | 0.171255 |
# PLot Impurity vs alpha
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Observation: can see that as alpha increase impurity increases very fast then levels off as we generalize better, then increases again. It makes sense as we have less features
# Setup decision classifiers and fit model with
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
# Plot nodes and depth vs alpha
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Observation: As alpha increases the nodes and depths of the tree decreases as well as show above.
# define classifier scores
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
# PLot accuracy vs alpha
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
Observation:
#prepare and display accuracy metrics
index_ar_model = np.argmax(test_scores)
ar_model = clfs[index_ar_model]
print(ar_model)
print("")
print('Training accuracy of ar_model --- ',ar_model.score(X_train, y_train))
print('Test accuracy of ar_model --- ',ar_model.score(X_test, y_test))
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1) Training accuracy of ar_model --- 0.9914285714285714 Test accuracy of ar_model --- 0.9826666666666667
Observation:
# calculate recall on train data
recall_train=[]
for clf in clfs:
pred_train3=clf.predict(X_train)
values_train=metrics.recall_score(y_train,pred_train3)
recall_train.append(values_train)
# calculate recall on test data
recall_test=[]
for clf in clfs:
pred_test3=clf.predict(X_test)
values_test=metrics.recall_score(y_test,pred_test3)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_ar_model = np.argmax(recall_test)
ar_model = clfs[index_ar_model]
print(ar_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
# Confusion matrix for grid search
generate_confusion_matrix(ar_model,X_test,y_test)
#variables to later compare
ar_model_acc_train = ar_model.score(X_train, y_train)
ar_model_acc_test = ar_model.score(X_test, y_test)
ar_model_recall = get_recall_score(ar_model)
# Accuracy on train and test
print("Accuracy on training set -- ",ar_model_acc_train)
print("Accuracy on test set -- ",ar_model_acc_test)
print("Recall on training set -- ",ar_model_recall[0])
print("Recall on test set -- ", ar_model_recall[1])
Accuracy on training set -- 0.9914285714285714 Accuracy on test set -- 0.9826666666666667 Recall on training set -- 0.945619335347432 Recall on test set -- 0.8926174496644296
#visualizing the tree with a tree plot
plt.figure(figsize=(17,15))
tree.plot_tree(ar_model,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(ar_model,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- weights: [63.00, 3.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 32.50 | | | | | |--- weights: [12.00, 1.00] class: 0 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [1.00, 5.00] class: 1 | | | |--- Education > 1.50 | | | | |--- weights: [11.00, 28.00] class: 1 |--- Income > 116.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [375.00, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 47.00] class: 1 | |--- Education > 1.50 | | |--- weights: [0.00, 222.00] class: 1
#printing the features and the order of importance, set to markdown
print (pd.DataFrame(ar_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False).to_markdown())
| | Imp | |:------------------------------|----------:| | Education | 0.437917 | | Income | 0.325272 | | Family | 0.156373 | | CCAvg | 0.0412808 | | CD_Account | 0.024775 | | Age | 0.0143823 | | const | 0 | | City_San Leandro | 0 | | City_San Rafael | 0 | | City_San Pablo | 0 | | City_San Mateo | 0 | | City_San Marcos | 0 | | City_San Luis Rey | 0 | | City_San Luis Obispo | 0 | | City_San Gabriel | 0 | | City_San Juan Capistrano | 0 | | City_San Juan Bautista | 0 | | City_San Jose | 0 | | City_San Ysidro | 0 | | City_San Francisco | 0 | | City_San Dimas | 0 | | City_San Diego | 0 | | City_San Ramon | 0 | | City_Santa Clara | 0 | | City_Sanger | 0 | | City_Saratoga | 0 | | City_Signal Hill | 0 | | City_Sierra Madre | 0 | | City_Sherman Oaks | 0 | | City_Seaside | 0 | | City_Seal Beach | 0 | | City_Sausalito | 0 | | City_Santa Ynez | 0 | | City_Santa Ana | 0 | | City_Santa Rosa | 0 | | City_Santa Monica | 0 | | City_Santa Cruz | 0 | | City_Santa Clarita | 0 | | City_San Bruno | 0 | | City_Santa Barbara | 0 | | City_San Clemente | 0 | | City_Salinas | 0 | | City_San Bernardino | 0 | | City_Placentia | 0 | | City_Poway | 0 | | City_Portola Valley | 0 | | City_Porter Ranch | 0 | | City_Pomona | 0 | | City_Pleasanton | 0 | | City_Pleasant Hill | 0 | | City_Pasadena | 0 | | City_San Anselmo | 0 | | City_Palos Verdes Peninsula | 0 | | City_Palo Alto | 0 | | City_Pacific Palisades | 0 | | City_Pacific Grove | 0 | | City_Oxnard | 0 | | City_Orange | 0 | | City_Rancho Cordova | 0 | | City_Rancho Cucamonga | 0 | | City_Rancho Palos Verdes | 0 | | City_Redding | 0 | | City_Redlands | 0 | | City_Redondo Beach | 0 | | City_Redwood City | 0 | | City_Reseda | 0 | | City_Richmond | 0 | | City_Ridgecrest | 0 | | City_Rio Vista | 0 | | City_Riverside | 0 | | City_Rohnert Park | 0 | | City_Rosemead | 0 | | City_Roseville | 0 | | City_Sacramento | 0 | | City_Sonora | 0 | | City_Simi Valley | 0 | | City_South Pasadena | 0 | | City_South Gate | 0 | | County_Merced County | 0 | | County_Sacramento County | 0 | | County_Riverside County | 0 | | County_Placer County | 0 | | County_Orange County | 0 | | County_Napa County | 0 | | County_Monterey County | 0 | | County_Mendocino County | 0 | | County_San Bernardino County | 0 | | County_Marin County | 0 | | County_Los Angeles County | 0 | | County_Lake County | 0 | | County_Kern County | 0 | | County_Imperial County | 0 | | County_Humboldt County | 0 | | County_San Benito County | 0 | | County_San Diego County | 0 | | County_El Dorado County | 0 | | County_Siskiyou County | 0 | | County_Ventura County | 0 | | County_Tuolumne County | 0 | | County_Trinity County | 0 | | County_Stanislaus County | 0 | | County_Sonoma County | 0 | | County_Solano County | 0 | | County_Shasta County | 0 | | County_San Francisco County | 0 | | County_Santa Cruz County | 0 | | County_Santa Clara County | 0 | | County_Santa Barbara County | 0 | | County_San Mateo County | 0 | | County_San Luis Obispo County | 0 | | County_San Joaquin County | 0 | | County_Fresno County | 0 | | County_Contra Costa County | 0 | | City_South Lake Tahoe | 0 | | City_Sunnyvale | 0 | | City_Trinity Center | 0 | | City_Torrance | 0 | | City_Thousand Oaks | 0 | | City_Tehachapi | 0 | | City_Tahoe City | 0 | | City_Sylmar | 0 | | City_Sunland | 0 | | City_Ukiah | 0 | | City_Studio City | 0 | | City_Stockton | 0 | | City_Stinson Beach | 0 | | City_Stanford | 0 | | City_South San Francisco | 0 | | City_Oceanside | 0 | | City_Tustin | 0 | | City_Upland | 0 | | County_Butte County | 0 | | City_West Covina | 0 | | City_Yucaipa | 0 | | City_Yorba Linda | 0 | | City_Woodland Hills | 0 | | City_Whittier | 0 | | City_Westlake Village | 0 | | City_West Sacramento | 0 | | City_Weed | 0 | | City_Valencia | 0 | | City_Walnut Creek | 0 | | City_Vista | 0 | | City_Ventura | 0 | | City_Venice | 0 | | City_Van Nuys | 0 | | City_Vallejo | 0 | | City_Ojai | 0 | | City_Norwalk | 0 | | City_Oakland | 0 | | City_Chino Hills | 0 | | City_Concord | 0 | | City_Clovis | 0 | | City_Clearlake | 0 | | City_Claremont | 0 | | City_Citrus Heights | 0 | | City_Chula Vista | 0 | | City_Chino | 0 | | City_Crestline | 0 | | City_Chico | 0 | | City_Chatsworth | 0 | | City_Ceres | 0 | | City_Castro Valley | 0 | | City_Carson | 0 | | City_Carpinteria | 0 | | City_Costa Mesa | 0 | | City_Culver City | 0 | | City_Cardiff By The Sea | 0 | | City_El Segundo | 0 | | City_Eureka | 0 | | City_Escondido | 0 | | City_Encinitas | 0 | | City_Emeryville | 0 | | City_Elk Grove | 0 | | City_El Sobrante | 0 | | City_El Dorado Hills | 0 | | City_Cupertino | 0 | | City_Edwards | 0 | | City_Diamond Bar | 0 | | City_Davis | 0 | | City_Danville | 0 | | City_Daly City | 0 | | City_Cypress | 0 | | City_Carlsbad | 0 | | City_Capitola | 0 | | City_Oak View | 0 | | City_Alhambra | 0 | | City_Bakersfield | 0 | | City_Arcata | 0 | | City_Arcadia | 0 | | City_Aptos | 0 | | City_Antioch | 0 | | City_Anaheim | 0 | | City_Albany | 0 | | City_Banning | 0 | | City_Alamo | 0 | | City_Alameda | 0 | | CreditCard | 0 | | Online | 0 | | Securities_Account | 0 | | Mortgage | 0 | | City_Baldwin Park | 0 | | City_Bella Vista | 0 | | City_Capistrano Beach | 0 | | City_Brea | 0 | | City_Canoga Park | 0 | | City_Campbell | 0 | | City_Camarillo | 0 | | City_Calabasas | 0 | | City_Burlingame | 0 | | City_Brisbane | 0 | | City_Boulder Creek | 0 | | City_Belmont | 0 | | City_Bonita | 0 | | City_Bodega Bay | 0 | | City_Beverly Hills | 0 | | City_Berkeley | 0 | | City_Ben Lomond | 0 | | City_Belvedere Tiburon | 0 | | City_Fairfield | 0 | | City_Fallbrook | 0 | | City_Fawnskin | 0 | | City_Martinez | 0 | | City_Modesto | 0 | | City_Mission Viejo | 0 | | City_Mission Hills | 0 | | City_Milpitas | 0 | | City_Merced | 0 | | City_Menlo Park | 0 | | City_Marina | 0 | | City_Montague | 0 | | City_March Air Reserve Base | 0 | | City_Manhattan Beach | 0 | | City_Los Gatos | 0 | | City_Los Angeles | 0 | | City_Los Altos | 0 | | City_Los Alamitos | 0 | | City_Monrovia | 0 | | City_Montclair | 0 | | City_Folsom | 0 | | City_National City | 0 | | City_Novato | 0 | | City_Northridge | 0 | | City_North Hollywood | 0 | | City_North Hills | 0 | | City_Newport Beach | 0 | | City_Newbury Park | 0 | | City_Napa | 0 | | City_Montebello | 0 | | City_Mountain View | 0 | | City_Moss Landing | 0 | | City_Morgan Hill | 0 | | City_Moraga | 0 | | City_Monterey Park | 0 | | City_Monterey | 0 | | City_Long Beach | 0 | | City_Lompoc | 0 | | City_Lomita | 0 | | City_Goleta | 0 | | City_Hermosa Beach | 0 | | City_Hayward | 0 | | City_Hawthorne | 0 | | City_Half Moon Bay | 0 | | City_Hacienda Heights | 0 | | City_Greenbrae | 0 | | City_Glendora | 0 | | City_Loma Linda | 0 | | City_Glendale | 0 | | City_Gilroy | 0 | | City_Garden Grove | 0 | | City_Fullerton | 0 | | City_Fresno | 0 | | City_Fremont | 0 | | City_Highland | 0 | | City_Hollister | 0 | | City_Hopland | 0 | | City_Huntington Beach | 0 | | City_Imperial | 0 | | City_Inglewood | 0 | | City_Irvine | 0 | | City_La Jolla | 0 | | City_La Mesa | 0 | | City_La Mirada | 0 | | City_La Palma | 0 | | City_Ladera Ranch | 0 | | City_Laguna Hills | 0 | | City_Laguna Niguel | 0 | | City_Lake Forest | 0 | | City_Larkspur | 0 | | City_Livermore | 0 | | County_Yolo County | 0 |
# code that setsup dataframe that will show results of all models to allow for a selection
comparison_frame = pd.DataFrame(
{
'Model':
[
"Logistic Regression sklearn",
"Logistic Regression with / ROC AUC thres",
"Logistic Regression with / prec recall thres",
'Initial decision tree model',
'Decision tree with restricted maximum depth',
'Decision tree with hyper-parameter tuning',
'Decision tree with post-pruning'
],
'Train_Recall':
[
logRegTrainPerf.loc[0][1],
logRegTrainPerfThreshold_AucRoc.loc[0][1],
logRegTrainPerf_threshold_P_Recall.loc[0][1],
loanTree_recall[0],
loanTree_prePruned_recall[0],
loanTree_grid_recall[0],
ar_model_recall[0]
],
'Test_Recall':
[
logRegTestPerf.loc[0][1],
logRegTestPerf_threshold_auc_roc.loc[0][1],
logRegTestPerfThreshold_P_Recall.loc[0][1],
loanTree_recall[0],
loanTree_prePruned_recall[1],
loanTree_grid_recall[1],
ar_model_recall[1]
]
})
comparison_frame
| Model | Train_Recall | Test_Recall | |
|---|---|---|---|
| 0 | Logistic Regression sklearn | 0.616314 | 0.557047 |
| 1 | Logistic Regression with / ROC AUC thres | 0.900302 | 0.865772 |
| 2 | Logistic Regression with / prec recall thres | 0.719033 | 0.644295 |
| 3 | Initial decision tree model | 1.000000 | 1.000000 |
| 4 | Decision tree with restricted maximum depth | 0.812689 | 0.751678 |
| 5 | Decision tree with hyper-parameter tuning | 0.927492 | 0.879195 |
| 6 | Decision tree with post-pruning | 0.945619 | 0.892617 |
Recommendation 1: It's clear that higher education does have an impact on whether someone purchases a loan, the education was the feature pointed out with the highest importance in independent variables. I would target advertising and marketing to customers that have educations levels of graduate and advanced levels, spend less on undergrads.
Recommendation 2: Target ad and marketing campaign to customers that have higher incomes. Our modeling shows income as the second most important feature. In our data analysis it's also easy to see income being an higher in the personal loan takers, over 100k is a level that is identified in the data.
Recommendation 3: Target ad and marketing campaign to customers that have 2 or more family members in houshold, our modeling shows family size as the third most important feature. In our data analysis it's also shown that personal loan takers have 2 ore more family members in the top 3 quartiles of data.
Recommendation 4: Target ad and marketing campaign to customers that spend more than 2,500 dollars a month on their credit cards, as shown in the data analysis these personal loan customers in the top 3 quartiles spent more than 3k. This is also a feature identified as a top 5 in our model.
Recommendation 5: Target ad and marketing campaign to customers that have a CD account. I pointed out that this seems to be the case in the data analysis and the model also proves that this feature of a customer can indicate more uptake of a personal loan.